EnerNOC GreenButton Data is a subset of the Open EnerNOC data repository. The raw data was provided by the EnerNOC electricity supplier and contains anonymous 5-minute electricity consumption data for 100 commercial/industrial sites for the year 2012. The simplified version contains data at 30-minute intervals.
The explanatory variables are:
Load Necessary Packages
library(feather)
library(data.table)
library(mgcv)
## Loading required package: nlme
## This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
library(car)
## Loading required package: carData
library(ggplot2)
library(dplyr, warn.conflicts = FALSE)
Read the Data and show the overview of the data.
DT <- as.data.table(read_feather("DT_4_ind"))
str(DT)
## Classes 'data.table' and 'data.frame': 70080 obs. of 5 variables:
## $ date_time: POSIXct, format: "2012-01-02 00:00:00" "2012-01-02 00:30:00" ...
## $ value : num 1590 1564 1560 1585 1604 ...
## $ week : chr "Monday" "Monday" "Monday" "Monday" ...
## $ date : Date, format: "2012-01-02" "2012-01-02" ...
## $ type : chr "Commercial Property" "Commercial Property" "Commercial Property" "Commercial Property" ...
## - attr(*, ".internal.selfref")=<externalptr>
Plot the Data
ggplot(data = DT, aes(x = date, y = value)) +
geom_line() +
facet_grid(type ~ ., scales = "free_y") +
theme(panel.border = element_blank(),
panel.background = element_blank(),
panel.grid.minor = element_line(colour = "grey90"),
panel.grid.major = element_line(colour = "grey90"),
panel.grid.major.x = element_line(colour = "grey90"),
axis.text = element_text(size = 10),
axis.title = element_text(size = 12, face = "bold"),
strip.text = element_text(size = 9, face = "bold")) +
labs(x = "Date", y = "Load (kW)")
We can see that the electricity consumption of the category
Food Sales & Storage is not affected by weekdays or
weekends.
We use the car::record() function to easily describe the
relationship between the electricity consumption and each day of the
week. We do this by appending a new column that associates each day of
the week with a unique number
DT[, week_num := as.integer(car::recode(week,
"'Monday'='1';'Tuesday'='2';'Wednesday'='3';'Thursday'='4';
'Friday'='5';'Saturday'='6';'Sunday'='7'"))]
unique(DT[, week])
## [1] "Monday" "Tuesday" "Wednesday" "Thursday" "Friday" "Saturday"
## [7] "Sunday"
unique(DT[, week_num])
## [1] 1 2 3 4 5 6 7
We extract information related to “industry”, “data”, “week” and
“period” from the dataset. Since the data was collected every half an
hour, there’re 48 consecutive observations within a day, thus we have
period <- 48.
n_type <- unique(DT[, type])
n_date <- unique(DT[, date])
n_weekdays <- unique(DT[, week])
period <- 48
We select the electricity consumption of a commercial building, store
it as the variable data_r and plot the data.
type == n_type[1] stands for “Commercial Property”,
whereas date %in% n_date[57:70] corresponds to two
weeks.
data_r <- DT[(type == n_type[1] & date %in% n_date[57:70])]
ggplot(data_r, aes(date_time, value)) +
geom_line() +
theme(panel.border = element_blank(),
panel.background = element_blank(),
panel.grid.minor = element_line(colour = "grey90"),
panel.grid.major = element_line(colour = "grey90"),
panel.grid.major.x = element_line(colour = "grey90"),
axis.text = element_text(size = 10),
axis.title = element_text(size = 12, face = "bold")) +
labs(x = "Date", y = "Load (kW)")
We re-organize the data in accordance with the change of days and weeks.
N <- nrow(data_r) # number of rows in the training set
window <- N / period # number of days in the training set
matrix_gam <- data.table(Load = data_r[, value],
Daily = rep(1:period, window),
Weekly = data_r[, week_num])
head(matrix_gam)
## Load Daily Weekly
## <num> <int> <int>
## 1: 1630.875 1 1
## 2: 1611.201 2 1
## 3: 1657.160 3 1
## 4: 1653.042 4 1
## 5: 1702.316 5 1
## 6: 1648.411 6 1
We use the mgcv:gam() function to build the GAM model.
The periodic change in days is described by a “cubic regression spline”,
whereas the periodic change in weeks is described by “P-splines”.
gam_1 <- gam(Load ~ s(Daily, bs = "cr", k = period) +
s(Weekly, bs = "ps", k = 7),
data = matrix_gam,
family = gaussian)
Inspect the summary of the model.
summary(gam_1)$r.sq
## [1] 0.7718406
summary(gam_1)$sp.criterion
## GCV.Cp
## 245544.9
GCV is an indicator of the fit of the model. The lower
this value is, the fitter the model is. In addition we can see that
R-sq is not high, which indicates the bad performance of
the model.
Compare the difference between the prediction and the reality over those two weeks.
matrix_gam$Predict=gam_1$fitted.values
ggplot(matrix_gam[1:nrow(matrix_gam),], aes(1:nrow(matrix_gam)))+
labs("lab")+
geom_line(aes(y=Load, color="Real"), size = 0.8)+
geom_line(aes(y = Predict, color = "Predict"), size = 0.8)
It doesn’t look nice. We carefully inspect the electricity consumption during the first week.
row_Mon <- nrow(matrix_gam)/2
matrix_gam$Predict=gam_1$fitted.values
ggplot(matrix_gam[1:row_Mon,], aes(1:row_Mon))+
labs("lab")+
geom_line(aes(y=Load, color="Real"), size = 0.8)+
geom_line(aes(y = Predict, color = "Predict"), size = 0.8)
This model can only predict the trend of the weekdays’ electricity consumption, but fails in predicting the exact amount of electricity consumption. We do a detailed inspection on the electricity consumption on Monday.
row_Mon <- nrow(matrix_gam)/14
matrix_gam$Predict=gam_1$fitted.values
ggplot(matrix_gam[1:row_Mon,], aes(1:row_Mon))+
labs("lab")+
geom_line(aes(y=Load, color="Real"), size = 0.8)+
geom_line(aes(y = Predict, color = "Predict"), size = 0.8)
The problem is that the real electricity consumption at the end of the day is not at the same level as it at the beginning of the day, but the model shows a pure periodic change within the day, which is different from the reality. Thus we need to change our mindset and build another model.
This time we use the method of interaction between different scale
and build a model by combining Daily and
Weekly.
gam_2 <- gam(Load ~ s(Daily, Weekly),
data = matrix_gam,
family = gaussian)
summary(gam_2)$r.sq
## [1] 0.9352108
summary(gam_2)$sp.criterion
## GCV.Cp
## 71162.37
According to R.sq and p-value, we can say
that this model performs better.
Plot the difference between predicted result and the real result.
datas <- rbindlist(list(data_r[, .(value, date_time)],
data.table(value = gam_2$fitted.values,
data_time = data_r[, date_time])))
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
datas[, type := c(rep("Real", nrow(data_r)), rep("Fitted", nrow(data_r)))]
ggplot(data = datas, aes(date_time, value, group = type, colour = type)) +
geom_line(size = 0.8) +
theme_bw() +
labs(x = "Time", y = "Load (kW)",
title = "Fit from GAM n.2")
We can see that the fit during Monday to Thursday significantly improved.
Next, we use another advanced method of interaction and use a smooth function called “tensor product”.
gam_3 <- gam(Load ~ te(Daily, Weekly,
bs = c("cr", "ps")),
data = matrix_gam,
family = gaussian)
summary(gam_3)$r.sq
## [1] 0.9268452
summary(gam_3)$sp.criterion
## GCV.Cp
## 79724.9
Plot gam_3
datas <- rbindlist(list(data_r[, .(value, date_time)],
data.table(value = gam_3$fitted.values,
data_time = data_r[, date_time])))
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
datas[, type := c(rep("Real", nrow(data_r)), rep("Fitted", nrow(data_r)))]
ggplot(data = datas, aes(date_time, value, group = type, colour = type)) +
geom_line(size = 0.8) +
theme_bw() +
labs(x = "Time", y = "Load (kW)",
title = "Fit from GAM n.3")
We can make it even better. For example let the knots (a concept that is similar to dimension) fit the periodic change of days and weeks better.
gam_4 <- gam(Load ~ te(Daily, Weekly,
k = c(period, 7),
bs = c("cr", "ps")),
data = matrix_gam,
family = gaussian)
summary(gam_4)$r.sq
## [1] 0.9727604
summary(gam_4)$sp.criterion
## GCV.Cp
## 34839.46
We can see that R-sq has increased a little bit. But the
most significant is edf, which has increased 5 times.
Plot gam_4
datas <- rbindlist(list(data_r[, .(value, date_time)],
data.table(value = gam_4$fitted.values,
data_time = data_r[, date_time])))
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
datas[, type := c(rep("Real", nrow(data_r)), rep("Fitted", nrow(data_r)))]
ggplot(data = datas, aes(date_time, value, group = type, colour = type)) +
geom_line(size = 0.8) +
theme_bw() +
labs(x = "Time", y = "Load (kW)",
title = "Fit from GAM n.4")
All right, what about be greedier and combine all the previous
models? Let’s examine our thought by building gam_5
gam_5 <- gam(Load ~ s(Daily, bs = "cr", k = period) +
s(Weekly, bs = "ps", k = 7) +
ti(Daily, Weekly,
k = c(period, 7),
bs = c("cr", "ps")),
data = matrix_gam,
family = gaussian)
summary(gam_5)$r.sq
## [1] 0.9717469
summary(gam_5)$sp.criterion
## GCV.Cp
## 35772.35
Although p-value remains \(0\), but R-sq decreased and
GCV increased. This means that the performance is not as
good as the previous gam_4 model.
Plot gam_5
datas <- rbindlist(list(data_r[, .(value, date_time)],
data.table(value = gam_5$fitted.values,
data_time = data_r[, date_time])))
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
datas[, type := c(rep("Real", nrow(data_r)), rep("Fitted", nrow(data_r)))]
ggplot(data = datas, aes(date_time, value, group = type, colour = type)) +
geom_line(size = 0.8) +
theme_bw() +
labs(x = "Time", y = "Load (kW)",
title = "Fit from GAM n.5")
Now is our last attempt. Here we add another method of tensor product
interations and introduce a stricter condition by setting
full = TRUE.
gam_6 <- gam(Load ~ t2(Daily, Weekly,
k = c(period, 7),
bs = c("cr", "ps"),
full = TRUE),
data = matrix_gam,
family = gaussian)
summary(gam_6)$r.sq
## [1] 0.9738273
summary(gam_6)$sp.criterion
## GCV.Cp
## 32230.68
Plot gam_6
datas <- rbindlist(list(data_r[, .(value, date_time)],
data.table(value = gam_6$fitted.values,
data_time = data_r[, date_time])))
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
datas[, type := c(rep("Real", nrow(data_r)), rep("Fitted", nrow(data_r)))]
ggplot(data = datas, aes(date_time, value, group = type, colour = type)) +
geom_line(size = 0.8) +
theme_bw() +
labs(x = "Time", y = "Load (kW)",
title = "Fit from GAM n.6")
This plot looks even better.
With so many models, how to decide which one is the best? Just ask
the omnipotent AIC.
AIC(gam_1, gam_2, gam_3, gam_4, gam_5, gam_6)
## df AIC
## gam_1 17.46996 10248.993
## gam_2 30.70080 9415.768
## gam_3 25.65709 9492.545
## gam_4 121.41166 8912.611
## gam_5 115.80849 8932.746
## gam_6 100.12005 8868.628
Apparently gam_4, gam_5, gam_6
are on the leading board. gam_6 has the best performance,
while gam_4 comes in second.
Next we plot gam_4, gam_6 together and see
what we found.
layout(matrix(1:2, nrow = 1))
plot(gam_4, rug = FALSE, se = FALSE, n2 = 80, main = "gam n.4 with te()")
plot(gam_6, rug = FALSE, se = FALSE, n2 = 80, main = "gam n.6 with t2()")
These contour lines indicate each model’s response on
Weekly and Daily. Although they look similar,
the contour of gam_6 has more wave-like patterns. This is
an indication of its higher sensitivity.
Before the end of this section, let’s see what we can do to make the
plot of gam_6 looks better. Firstly we use the
vis.gam function from the package mgcv.
# vis.gam(gam_6, main = "t2(D, W)", plot.type = "contour",
# color = "terrain", contour.col = "black", lwd = 2)
vis.gam(gam_6, main = "t2(D, W)",
color = "terrain", contour.col = "black", lwd = 2)
We can see that the electricity consumption on weekdays are higher than on weekends. The peak hours are arround 3 pm from Monday to Thursday.
Without using the contour.col option, we can make a 3D
plot.
vis.gam(gam_6, n.grid = 50, theta = 35, phi = 32, zlab = "",
ticktype = "detailed", color = "topo", main = "t2(D, W)")
Change the viewing angle
vis.gam(gam_6, n.grid = 50, theta = 190, phi = 20, zlab = "",
ticktype = "detailed", color = "topo", main = "t2(D, W)")
Now let’s see what would happen if we discard explanatory variables one by one.
gam_6D <- gam(Load ~ t2(Daily,
k = period,
bs = "cr",
full = TRUE),
data = matrix_gam,
family = gaussian)
summary(gam_6D)$r.sq
## [1] 0.5129567
summary(gam_6D)$sp.criterion
## GCV.Cp
## 518169.2
Em, it doesn’t look good.
Now let’s discard variable Daily
gam_6W <- gam(Load ~ t2(Weekly,
k = 7,
bs = "ps",
full = TRUE),
data = matrix_gam,
family = gaussian)
summary(gam_6W)$r.sq
## [1] 0.2502016
summary(gam_6W)$sp.criterion
## GCV.Cp
## 791454.6
We can see it looks much worse.
Let’s maintain a rigorous attitude and use ANOVA to compare the
difference between these three leading models. Firstly we discard
variable Weekly and see what will happen
anova(gam_6, gam_6D, test="F")
## Analysis of Deviance Table
##
## Model 1: Load ~ t2(Daily, Weekly, k = c(period, 7), bs = c("cr", "ps"),
## full = TRUE)
## Model 2: Load ~ t2(Daily, k = period, bs = "cr", full = TRUE)
## Resid. Df Resid. Dev Df Deviance F Pr(>F)
## 1 550.77 15740821
## 2 661.13 339050831 -110.37 -323310010 106.61 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Then we discard variable Daily and see what will
happen
anova(gam_6, gam_6W, test="F")
## Analysis of Deviance Table
##
## Model 1: Load ~ t2(Daily, Weekly, k = c(period, 7), bs = c("cr", "ps"),
## full = TRUE)
## Model 2: Load ~ t2(Weekly, k = 7, bs = "ps", full = TRUE)
## Resid. Df Resid. Dev Df Deviance F Pr(>F)
## 1 550.77 15740821
## 2 667.88 526095056 -117.11 -510354235 158.6 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The result is clear – non of the variable Weekly and
variable Daily can be dropped!
Plot the result of discarding variable Weekly.
datas <- rbindlist(list(data_r[, .(value, date_time)],
data.table(value = gam_6D$fitted.values,
data_time = data_r[, date_time])))
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
datas[, type := c(rep("Real", nrow(data_r)), rep("Fitted", nrow(data_r)))]
ggplot(data = datas, aes(date_time, value, group = type, colour = type)) +
geom_line(size = 0.8) +
theme_bw() +
labs(x = "Time", y = "Load (kW)",
title = "Fit from GAM n.6")
We can see that there is no weekly difference in the electricity
consumption when the variable Weekly is dropped.
Plot the result of discarding variable Daily.
datas <- rbindlist(list(data_r[, .(value, date_time)],
data.table(value = gam_6W$fitted.values,
data_time = data_r[, date_time])))
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
datas[, type := c(rep("Real", nrow(data_r)), rep("Fitted", nrow(data_r)))]
ggplot(data = datas, aes(date_time, value, group = type, colour = type)) +
geom_line(size = 0.8) +
theme_bw() +
labs(x = "Time", y = "Load (kW)",
title = "Fit from GAM n.6")
We can see that there is no difference in the electricity consumption
over the 24 hours of a day when the variable Daily is
dropped.
Lastly, the most exciting part – let’s predict the electricity consumption for the next two weeks.
data_test <- DT[(type == n_type[1] & date %in% n_date[71:84])]
matrix_test <- data.table(Load = data_test[, value],
Daily = rep(1:period, window),
Weekly = data_test[, week_num])
pred_week <- predict(gam_6, matrix_test[1:(7*period)],interval="confidence", level = 0.95)
datat <- rbindlist(list(data_test[, .(value, date_time)],
data.table(value = pred_week,
data_time = data_test[, date_time])))
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
datat[, type := c(rep("Real", nrow(data_test)), rep("Predicted", nrow(data_test)))]
ggplot(data = datat, aes(date_time, value, group = type, colour = type)) +
geom_line(size = 0.8) +
theme_bw() +
labs(x = "Time", y = "Load (kW)",
title = "Predicted result on GAM n.6")
Predict the electricity consumption of the next month.
data_test <- DT[(type == n_type[1] & date %in% n_date[71:98])]
matrix_test <- data.table(Load = data_test[, value],
Daily = rep(1:period, window),
Weekly = data_test[, week_num])
pred_week <- predict(gam_6, matrix_test[1:(7*period)],interval="confidence", level = 0.95)
datat <- rbindlist(list(data_test[, .(value, date_time)],
data.table(value = pred_week,
data_time = data_test[, date_time])))
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
datat[, type := c(rep("Real", nrow(data_test)), rep("Predicted", nrow(data_test)))]
ggplot(data = datat, aes(date_time, value, group = type, colour = type)) +
geom_line(size = 0.8) +
theme_bw() +
labs(x = "Time", y = "Load (kW)",
title = "Predicted result on GAM n.6")
Additional content that was added at the end of the semester.
Define the MAPE function.
mape <- function(real, pred){
return(100 * mean(abs((real - pred)/real)))
}
Define the criteria for model evaluation (R-sq, GCV, MAPE).
gam_eval <- function(model){
return(data.table(RSQ=summary(model)$r.sq,
GCV=summary(model)$sp.criterion,
MAPE=mape(data_new[, value], model$fitted.values)))
}
Define the function for plotting models.
gam_plot <- function(model, title){
datas <- rbindlist(list(data_new[, .(value, date_time)],
data.table(value = model$fitted.values,
data_time = data_new[, date_time])))
datas[, type := c(rep("Real", nrow(data_new)), rep("Fitted", nrow(data_new)))]
ggplot(data = datas, aes(date_time, value, group = type, colour = type)) +
geom_line(size = 0.8) +
theme_bw() +
labs(x = "Time", y = "Load (kW)",
title = title)
}
Use the first season’s data to train a model.
data_new <- DT[(type == n_type[1] & date %in% n_date[1:91])]
ggplot(data_new, aes(date_time, value)) +
geom_line() +
theme(panel.border = element_blank(),
panel.background = element_blank(),
panel.grid.minor = element_line(colour = "grey90"),
panel.grid.major = element_line(colour = "grey90"),
panel.grid.major.x = element_line(colour = "grey90"),
axis.text = element_text(size = 10),
axis.title = element_text(size = 12, face = "bold")) +
labs(x = "Date", y = "Load (kW)")
Re-organize the data and add information about the electricity consumption of the previous one day and the previous week.
matrix_new <- data.table(Load = data_new[, value],
PrevDayLoad = c(data_new[1:48, value], data_new[1:4320, value]),
PrevWeekLoad = c(data_new[1:336, value], data_new[1:4032, value]),
Daily = rep(1:period, window),
Weekly = data_r[, week_num])
## Warning in as.data.table.list(x, keep.rownames = keep.rownames, check.names =
## check.names, : Item 4 has 672 rows but longest item has 4368; recycled with
## remainder.
## Warning in as.data.table.list(x, keep.rownames = keep.rownames, check.names =
## check.names, : Item 5 has 672 rows but longest item has 4368; recycled with
## remainder.
Build the first model.
gam_new_1 <- gam(Load ~ s(Daily, bs = "cr", k = period) +
s(Weekly, bs = "ps", k = 7),
data = matrix_new,
family = gaussian)
Evaluate the first model.
eval_1 <- gam_eval(gam_new_1)
eval_1
## RSQ GCV MAPE
## <num> <num> <num>
## 1: 0.7194149 311287.5 18.89976
Plot the first model.
gam_plot(gam_new_1, "Fit from gam_new_1")
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
Build the second model.
gam_new_2 <- gam(Load ~ s(Daily, Weekly),
data = matrix_new,
family = gaussian)
Evaluate the second model.
eval_2 <- gam_eval(gam_new_2)
eval_2
## RSQ GCV MAPE
## <num> <num> <num>
## 1: 0.8487935 168039.3 10.62365
Plot the second model.
gam_plot(gam_new_2, "Fit from gam_new_2")
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
Build the third model.
gam_new_3 <- gam(Load ~ te(Daily, Weekly,
bs = c("cr", "ps")),
data = matrix_new,
family = gaussian)
Evaluate the third model.
eval_3 <- gam_eval(gam_new_3)
eval_3
## RSQ GCV MAPE
## <num> <num> <num>
## 1: 0.8423209 175028.3 10.18985
Plot the third model.
gam_plot(gam_new_3, "Fit from gam_new_3")
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
Build the fourth model.
gam_new_4 <- gam(Load ~ te(Daily, Weekly,
k = c(period, 7),
bs = c("cr", "ps")),
data = matrix_new,
family = gaussian)
Evaluate the fourth model.
eval_4 <- gam_eval(gam_new_4)
eval_4
## RSQ GCV MAPE
## <num> <num> <num>
## 1: 0.8796766 135799.2 8.319575
Plot the fourth model.
gam_plot(gam_new_4, "Fit from gam_new_4")
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
Build the fifth model.
gam_new_5 <- gam(Load ~ s(Daily, bs = "cr", k = period) +
s(Weekly, bs = "ps", k = 7) +
ti(Daily, Weekly,
k = c(period, 7),
bs = c("cr", "ps")),
data = matrix_new,
family = gaussian)
Evaluate the fifth model.
eval_5 <- gam_eval(gam_new_5)
eval_5
## RSQ GCV MAPE
## <num> <num> <num>
## 1: 0.8801601 135306 8.337419
Plot the fifth model.
gam_plot(gam_new_5, "Fit from gam_new_5")
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
Build the sixth model.
gam_new_6 <- gam(Load ~ t2(Daily, Weekly,
k = c(period, 7),
bs = c("cr", "ps"),
full = TRUE),
data = matrix_new,
family = gaussian)
Evaluate the sixth model.
eval_6 <- gam_eval(gam_new_6)
eval_6
## RSQ GCV MAPE
## <num> <num> <num>
## 1: 0.8802681 134768.2 8.32655
Plot the sixth model.
gam_plot(gam_new_6, "Fit from gam_new_6")
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
Build the seventh model – without using smooth function of GAM.
gam_new_simple <- gam(Load ~ Daily+Weekly,
data = matrix_new,
family = gaussian)
Evaluate the seventh model.
eval_7 <- gam_eval(gam_new_simple)
eval_7
## RSQ GCV MAPE
## <num> <num> <num>
## 1: 0.4302044 629325.7 25.61359
Plot the seventh model.
gam_plot(gam_new_simple, "Fit from gam_new_simple")
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
Build the eighth model – only apply te smooth function on the
variable Daily.
gam_new_te_daily <- gam(Load ~ te(Daily, bs = "cr", k = period) +Weekly,
data = matrix_new,
family = gaussian)
Evaluate the eighth model.
eval_8 <- gam_eval(gam_new_te_daily)
eval_8
## RSQ GCV MAPE
## <num> <num> <num>
## 1: 0.6366094 402558.4 20.44199
Plot the eighth model.
gam_plot(gam_new_te_daily, "Fit from gam_new_te_daily")
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
Build the nineth model – only apply te smooth function on the
variable Weekly.
gam_new_te_weekly <- gam(Load ~ te(Weekly, bs = "ps", k = 7) +Daily,
data = matrix_new,
family = gaussian)
Evaluate the nineth model.
eval_9 <- gam_eval(gam_new_te_weekly)
eval_9
## RSQ GCV MAPE
## <num> <num> <num>
## 1: 0.5124041 539142 24.279
Plot the nineth model.
gam_plot(gam_new_te_weekly, "Fit from gam_new_te_weekly")
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
Use a table to analyse the nine models.
eval_table <- bind_rows(eval_1, eval_2, eval_3, eval_4, eval_5, eval_6, eval_7, eval_8, eval_9)
all_aic <- AIC(gam_new_1, gam_new_2, gam_new_3, gam_new_4, gam_new_5, gam_new_6, gam_new_simple, gam_new_te_daily, gam_new_te_weekly)$AIC
eval_table[, AIC :=all_aic]
eval_table
## RSQ GCV MAPE AIC
## <num> <num> <num> <num>
## 1: 0.7194149 311287.5 18.899761 67646.26
## 2: 0.8487935 168039.3 10.623654 64953.21
## 3: 0.8423209 175028.3 10.189855 65131.27
## 4: 0.8796766 135799.2 8.319575 64020.79
## 5: 0.8801601 135306.0 8.337419 64004.82
## 6: 0.8802681 134768.2 8.326550 63987.99
## 7: 0.4302044 629325.7 25.613586 70721.15
## 8: 0.6366094 402558.4 20.441993 68769.43
## 9: 0.5124041 539142.0 24.278996 70045.54
Predict the electricity consumption of season 2 using the fourth model.
data_test_2qt <- DT[(type == n_type[1] & date %in% n_date[92:183])]
matrix_test_2qt <- data.table(Load = data_test_2qt[, value],
Daily = rep(1:period, 91),
Weekly = data_test_2qt[, week_num])
pred_2qt <- predict(gam_new_4, matrix_test_2qt[1:(7*period)],interval="confidence", level = 0.95)
datat <- rbindlist(list(data_test_2qt[, .(value, date_time)],
data.table(value = pred_2qt,
data_time = data_test_2qt[, date_time])))
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
datat[, type := c(rep("Real", nrow(data_test_2qt)), rep("Predicted", nrow(data_test_2qt)))]
ggplot(data = datat, aes(date_time, value, group = type, colour = type)) +
geom_line(size = 0.8) +
theme_bw() +
labs(x = "Time", y = "Load (kW)",
title = "Predicted result on GAM n.6")
Predict the electricity consumption of season 3 using the fourth model.
data_test_3qt <- DT[(type == n_type[1] & date %in% n_date[183:274])]
matrix_test_3qt <- data.table(Load = data_test_3qt[, value],
Daily = rep(1:period, 91),
Weekly = data_test_3qt[, week_num])
pred_3qt <- predict(gam_new_4, matrix_test_3qt[1:(7*period)],interval="confidence", level = 0.95)
datat <- rbindlist(list(data_test_3qt[, .(value, date_time)],
data.table(value = pred_3qt,
data_time = data_test_3qt[, date_time])))
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
datat[, type := c(rep("Real", nrow(data_test_3qt)), rep("Predicted", nrow(data_test_3qt)))]
ggplot(data = datat, aes(date_time, value, group = type, colour = type)) +
geom_line(size = 0.8) +
theme_bw() +
labs(x = "Time", y = "Load (kW)",
title = "Predicted result on GAM n.6")
Predict the electricity consumption of season 4 using the fourth model.
data_test_4qt <- DT[(type == n_type[1] & date %in% n_date[274:365])]
matrix_test_4qt <- data.table(Load = data_test_4qt[, value],
Daily = rep(1:period, 91),
Weekly = data_test_4qt[, week_num])
pred_4qt <- predict(gam_new_4, matrix_test_4qt[1:(7*period)],interval="confidence", level = 0.95)
datat <- rbindlist(list(data_test_4qt[, .(value, date_time)],
data.table(value = pred_4qt,
data_time = data_test_4qt[, date_time])))
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
datat[, type := c(rep("Real", nrow(data_test_4qt)), rep("Predicted", nrow(data_test_4qt)))]
ggplot(data = datat, aes(date_time, value, group = type, colour = type)) +
geom_line(size = 0.8) +
theme_bw() +
labs(x = "Time", y = "Load (kW)",
title = "Predicted result on GAM n.6")
Compute the MAPE of the predictions.
mape_2qt <- mape(matrix_test_2qt[1:(7*period)]$Load, pred_2qt)
mape_3qt <- mape(matrix_test_3qt[1:(7*period)]$Load, pred_3qt)
mape_4qt <- mape(matrix_test_4qt[1:(7*period)]$Load, pred_4qt)
mapes <- cbind(mape_2qt, mape_3qt, mape_4qt)
mapes
## mape_2qt mape_3qt mape_4qt
## [1,] 12.01464 14.89136 12.93795